Python 문자열 인코딩

문자와 인코딩

문자의 구성
- 바이트 열 Byte Sequence: 컴퓨터에 저장되는 자료. 각 글자에 바이트 열을 지정
- 글리프 Glyph: 눈에 보이는 그림
- http://www.asciitable.com/
- http://www.kreativekorp.com/charset/encoding.php?name=CP949
- 코드 포인트 Code Point: 각 글자에 바이트 열과는 독립적인 숫자를 지정 (유니코드)
인코딩 (방식)
- 바이트 열을 지정하는 방식
- 기본 Ascii 인코딩
한글 인코딩
- euc-kr
- cp949
- utf-8
- 참고
  - http://d2.naver.com/helloworld/19187
  - http://d2.naver.com/helloworld/76650

Python 2 문자열

string 타입 (기본)
- 컴퓨터 환경에서 지정한 인코딩을 사용한 byte string
unicode 타입
- 유니코드 코드 포인트(Unicode Code Point)를 사용한 내부 저장
- string(byte string)과의 변환을 위해 encode(인코딩)/decode(디코딩) 명령 사용
Python 3에서는 unicode 타입이 기본

Python의 문자열 표시

__repr__()
- 그냥 변수이름을 쳤을 때 나오는 표시
- 다른 객체의 원소인 경우
- 아스키 테이블로 표시할 수 없는 문자는 string 포맷으로 표시
print() 명령
- 가능한 글리프(폰트)를 찾아서 출력



In [1]:

    
c = "a"
c









    Out[1]:





'a'



In [2]:

    
print(c)



In [3]:

    
x = "가"
x









    Out[3]:





'\xea\xb0\x80'



In [4]:

    
print(x)

가



In [5]:

    
print(x.__repr__())









    



'\xea\xb0\x80'



In [6]:

    
x = ["가"]
print(x)









    



['\xea\xb0\x80']



In [7]:

    
x = "가"
len(x)









    Out[7]:





3



In [14]:

    
x = "ABC"
y = "가나다"
print(len(x), len(y))
print(x[0], x[1], x[2])
print(y[0], y[1], y[2])
print(y[0], y[1], y[2], y[3])









    



3 9
A B C
� � �
� � � �

유니코드 리터럴(Literal)

따옴표 앞에 u자를 붙이면 unicode 문자열로 인식
내부적으로 유니코드 포인트로 저장



In [9]:

    
y = u"가"
y









    Out[9]:





u'\uac00'



In [10]:

    
print(y)

가



In [17]:

    
y = u"가나다"
print(y[0], y[1], y[2])









    



가 나 다

유니코드 인코딩(Encoding) / 디코딩(Decoding)

encode
- unicode 타입의 메소드
- unicode -> string (byte sequence)
decode
- str 타입의 메소드
- str -> unicode



In [26]:

    
print(type(y))
z1 = y.encode("cp949")
print(type(z1))
print(z1)









    



<type 'unicode'>
<type 'str'>
������



In [27]:

    
print(type(y))
z2 = y.encode("utf-8")
print(type(z2))
print(z2)









    



<type 'unicode'>
<type 'str'>
가나다



In [28]:

    
print(type(z1))
y1 = z1.decode("cp949")
print(type(y1))
print(y1)









    



<type 'str'>
<type 'unicode'>
가나다



In [29]:

    
print(type(z2))
y2 = z2.decode("utf-8")
print(type(y2))
print(y2)









    



<type 'str'>
<type 'unicode'>
가나다

str에 encode 메소드를 적용하면 또는 unicode에 decode 메소드를 적용하면?



In [33]:

    
"가".encode("utf-8")









    



UnicodeDecodeErrorTraceback (most recent call last)
<ipython-input-33-92305fa44153> in <module>()
----> 1 "가".encode("utf-8")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)



In [34]:

    
unicode("가", "ascii").encode("utf-8")









    



UnicodeDecodeErrorTraceback (most recent call last)
<ipython-input-34-c07f24263d81> in <module>()
----> 1 unicode("가", "ascii").encode("utf-8")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)



In [37]:

    
u"가".decode("utf-8")









    



UnicodeEncodeErrorTraceback (most recent call last)
<ipython-input-37-e1c95bb5b4e2> in <module>()
----> 1 u"가".decode("utf-8")

/home/joel/anaconda2/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeEncodeError: 'ascii' codec can't encode character u'\uac00' in position 0: ordinal not in range(128)



In [38]:

    
u"가".encode("ascii").decode("utf-8")









    



UnicodeEncodeErrorTraceback (most recent call last)
<ipython-input-38-99ffe5d7c928> in <module>()
----> 1 u"가".encode("ascii").decode("utf-8")

UnicodeEncodeError: 'ascii' codec can't encode character u'\uac00' in position 0: ordinal not in range(128)

str에 encode 메소드를 적용:
- 내부적으로 유니코드로 변환 시도
unicode에 decode 메소드를 적용:
- 바이트열이 스트링이라고 가정해 버린다.

디폴트 인코딩



In [39]:

    
u"가".encode("utf-8"), u"가".encode("cp949"), "가"









    Out[39]:





('\xea\xb0\x80', '\xb0\xa1', '\xea\xb0\x80')



In [40]:

    
import sys
print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
import locale
print(locale.getpreferredencoding())









    



ascii
None
UTF-8
UTF-8

인코딩 설정

콘솔(console) 입력의 경우
- 지정하지 않을 경우 windows는 CP949, linux/mac은 LOCALE 설정에 따른다.
- 환경변수 PYTHONIOENCODING 로 지정가능

파일 입력의 경우
- 첫줄에 다음과 같이 인코딩 설정
```
#-*- coding: utf-8 -*-
```